Imagine that we are some fancy data scientists exploring - once again - the Gapminder data. We are particularly interested in the development of the GDP across time and across countries. Some R-fanatics from GESIS suggested that we use this tidyverse thing to complete our tasks. They also told us that we do not always need to load all of its packages at once.
tidyverse for importing Excel data and for data wrangling.
Ok, that wasn’t too hard. But data science is about data, so we have to load the data we are interested in.
sheet = "name_of_your_sheet"
Have the data been successfully imported? They should be in a tibble with the dimensions 275 x 53. As a further check: The income per person for Algeria for the years 1960, 1961, and 1962 should be 1280, 1085, and 856.
select() and filtering rows by number with slice().
Let’s say that we are interested in the earliest 10 years as well as the most recent 10 years that appear in the dataset. If we want to aggregate the data per year, they should ideally be in long format.
gather(). Additionally, you might want to create a more convenient column name for the variable Income per person (fixed 2000 US$) with rename() (GDP might be a good choice here) and change the variable type for year to integer.
There are still a lot of missing values we might want to get rid of, and the data are not arranged in a way that is ideal to explore changes over time. For the next tasks, simply re-use the previous code and add the following commands with %>%.
filter() in combination with !is.na.
Now we have a - more or less - clean dataset for our actual task: calculating the mean values across all countries for each of the first ten years and each of the last ten years. What’s still a little bit distracting is that we have the values for all years between these two periods in the data. However, we might want to use some of these data points in future analyses. Hence, we will do all analyses ‘on the fly’ (i.e., without creating a new dataset). Let’s start with the first period.
GDP across all countries for each of the first ten years in the dataset.
integer, you can simply filter the range of years you are interested in. The first year in the dataset is 1960.
Now it should be easy do the same for the 10 most recent years in the dataset…
GDP across all countries for each of the last ten years.